Goto

Collaborating Authors

 Valladolid Province


A HEART for the environment: Transformer-Based Spatiotemporal Modeling for Air Quality Prediction

arXiv.org Artificial Intelligence

Accurate and reliable air pollution forecasting is crucial for effective environmental management and policy-making. llull-environment is a sophisticated and scalable forecasting system for air pollution, inspired by previous models currently operational in Madrid and Valladolid (Spain). It contains (among other key components) an encoder-decoder convolutional neural network to forecast mean pollution levels for four key pollutants (NO$_2$, O$_3$, PM$_{10}$, PM$_{2.5}$) using historical data, external forecasts, and other contextual features. This paper investigates the augmentation of this neural network with an attention mechanism to improve predictive accuracy. The proposed attention mechanism pre-processes tensors containing the input features before passing them to the existing mean forecasting model. The resulting model is a combination of several architectures and ideas and can be described as a "Hybrid Enhanced Autoregressive Transformer", or HEART. The effectiveness of the approach is evaluated by comparing the mean square error (MSE) across different attention layouts against the system without such a mechanism. We observe a significant reduction in MSE of up to 22%, with an average of 7.5% across tested cities and pollutants. The performance of a given attention mechanism turns out to depend on the pollutant, highlighting the differences in their creation and dissipation processes. Our findings are not restricted to optimizing air quality prediction models, but are applicable generally to (fixed length) time series forecasting.


Learning with Differentially Private (Sliced) Wasserstein Gradients

arXiv.org Artificial Intelligence

In this work, we introduce a novel framework for privately optimizing objectives that rely on Wasserstein distances between data-dependent empirical measures. Our main theoretical contribution is, based on an explicit formulation of the Wasserstein gradient in a fully discrete setting, a control on the sensitivity of this gradient to individual data points, allowing strong privacy guarantees at minimal utility cost. Building on these insights, we develop a deep learning approach that incorporates gradient and activations clipping, originally designed for DP training of problems with a finite-sum structure. We further demonstrate that privacy accounting methods extend to Wasserstein-based objectives, facilitating large-scale private training. Empirical results confirm that our framework effectively balances accuracy and privacy, offering a theoretically sound solution for privacy-preserving machine learning tasks relying on optimal transport distances such as Wasserstein distance or sliced-Wasserstein distance.


A Learnable Multi-views Contrastive Framework with Reconstruction Discrepancy for Medical Time-Series

arXiv.org Artificial Intelligence

In medical time series disease diagnosis, two key challenges are identified. First, the high annotation cost of medical data leads to overfitting in models trained on label-limited, single-center datasets. To address this, we propose incorporating external data from related tasks and leveraging AE-GAN to extract prior knowledge, providing valuable references for downstream tasks. Second, many existing studies employ contrastive learning to derive more generalized medical sequence representations for diagnostic tasks, usually relying on manually designed diverse positive and negative sample pairs. However, these approaches are complex, lack generalizability, and fail to adaptively capture disease-specific features across different conditions. To overcome this, we introduce LMCF (Learnable Multi-views Contrastive Framework), a framework that integrates a multi-head attention mechanism and adaptively learns representations from different views through inter-view and intra-view contrastive learning strategies. Additionally, the pre-trained AE-GAN is used to reconstruct discrepancies in the target data as disease probabilities, which are then integrated into the contrastive learning process. Experiments on three target datasets demonstrate that our method consistently outperforms other seven baselines, highlighting its significant impact on healthcare applications such as the diagnosis of myocardial infarction, Alzheimer's disease, and Parkinson's disease. We release the source code at xxxxx.


Towards an Operational Responsible AI Framework for Learning Analytics in Higher Education

arXiv.org Artificial Intelligence

Universities are increasingly adopting data-driven strategies to enhance student success, with AI applications like Learning Analytics (LA) and Predictive Learning Analytics (PLA) playing a key role in identifying at-risk students, personalising learning, supporting teachers, and guiding educational decision-making. However, concerns are rising about potential harms these systems may pose, such as algorithmic biases leading to unequal support for minority students. While many have explored the need for Responsible AI in LA, existing works often lack practical guidance for how institutions can operationalise these principles. In this paper, we propose a novel Responsible AI framework tailored specifically to LA in Higher Education (HE). We started by mapping 11 established Responsible AI frameworks, including those by leading tech companies, to the context of LA in HE. This led to the identification of seven key principles such as transparency, fairness, and accountability. We then conducted a systematic review of the literature to understand how these principles have been applied in practice. Drawing from these findings, we present a novel framework that offers practical guidance to HE institutions and is designed to evolve with community input, ensuring its relevance as LA systems continue to develop.


Detecci\'on Autom\'atica de Patolog\'ias en Notas Cl\'inicas en Espa\~nol Combinando Modelos de Lenguaje y Ontolog\'ias M\'edicos

arXiv.org Artificial Intelligence

In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology as well as in which order it has to learn these three features significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the dataset used available to the community.


Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

arXiv.org Artificial Intelligence

Data imputation and data generation have important applications for many domains, like healthcare and finance, where incomplete or missing data can hinder accurate analysis and decision-making. Diffusion models have emerged as powerful generative models capable of capturing complex data distributions across various data modalities such as image, audio, and time series data. Recently, they have been also adapted to generate tabular data. In this paper, we propose a diffusion model for tabular data that introduces three key enhancements: (1) a conditioning attention mechanism, (2) an encoder-decoder transformer as the denoising network, and (3) dynamic masking. The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data. The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks within a unified framework. We conduct a comprehensive evaluation by comparing the performance of diffusion models with transformer conditioning against state-of-the-art techniques, such as Variational Autoencoders, Generative Adversarial Networks and Diffusion Models, on benchmark datasets. Our evaluation focuses on the assessment of the generated samples with respect to three important criteria, namely: (1) Machine Learning efficiency, (2) statistical similarity, and (3) privacy risk mitigation. For the task of data imputation, we consider the efficiency of the generated samples across different levels of missing features.


DETECTA 2.0: Research into non-intrusive methodologies supported by Industry 4.0 enabling technologies for predictive and cyber-secure maintenance in SMEs

arXiv.org Artificial Intelligence

The integration of predictive maintenance and cybersecurity represents a transformative advancement for small and medium-sized enterprises (SMEs) operating within the Industry 4.0 paradigm. Despite their economic importance, SMEs often face significant challenges in adopting advanced technologies due to resource constraints and knowledge gaps. The DETECTA 2.0 project addresses these hurdles by developing an innovative system that harmonizes real-time anomaly detection, sophisticated analytics, and predictive forecasting capabilities. The system employs a semi-supervised methodology, combining unsupervised anomaly detection with supervised learning techniques. This approach enables more agile and cost-effective development of AI detection systems, significantly reducing the time required for manual case review. At the core lies a Digital Twin interface, providing intuitive real-time visualizations of machine states and detected anomalies. Leveraging cutting-edge AI engines, the system intelligently categorizes anomalies based on observed patterns, differentiating between technical errors and potential cybersecurity incidents. This discernment is fortified by detailed analytics, including certainty levels that enhance alert reliability and minimize false positives. The predictive engine uses advanced time series algorithms like N-HiTS to forecast future machine utilization trends. This proactive approach optimizes maintenance planning, enhances cybersecurity measures, and minimizes unplanned downtimes despite variable production processes. With its modular architecture enabling seamless integration across industrial setups and low implementation costs, DETECTA 2.0 presents an attractive solution for SMEs to strengthen their predictive maintenance and cybersecurity strategies.


Application of Machine Learning Algorithms in Classifying Postoperative Success in Metabolic Bariatric Surgery: A Comprehensive Study

arXiv.org Artificial Intelligence

Objectives: Metabolic Bariatric Surgery (MBS) is a critical intervention for patients living with obesity and related health issues. Accurate classification and prediction of patient outcomes are vital for optimizing treatment strategies. This study presents a novel machine learning approach to classify patients in the context of metabolic bariatric surgery, providing insights into the efficacy of different models and variable types. Methods: Various machine learning models, including GaussianNB, ComplementNB, KNN, Decision Tree, KNN with RandomOverSampler, and KNN with SMOTE, were applied to a dataset of 73 patients. The dataset, comprising psychometric, socioeconomic, and analytical variables, was analyzed to determine the most efficient predictive model. The study also explored the impact of different variable groupings and oversampling techniques. Results: Experimental results indicate average accuracy values as high as 66.7% for the best model. Enhanced versions of KNN and Decision Tree, along with variations of KNN such as RandomOverSampler and SMOTE, yielded the best results. Conclusions: The study unveils a promising avenue for classifying patients in the realm of metabolic bariatric surgery. The results underscore the importance of selecting appropriate variables and employing diverse approaches to achieve optimal performance. The developed system holds potential as a tool to assist healthcare professionals in decision-making, thereby enhancing metabolic bariatric surgery outcomes. These findings lay the groundwork for future collaboration between hospitals and healthcare entities to improve patient care through the utilization of machine learning algorithms. Moreover, the findings suggest room for improvement, potentially achievable with a larger dataset and careful parameter tuning.


Open Source Conversational LLMs do not know most Spanish words

arXiv.org Artificial Intelligence

The growing interest in Large Language Models (LLMs) and in particular in conversational models with which users can interact has led to the development of a large number of open-source chat LLMs. These models are evaluated on a wide range of benchmarks to assess their capabilities in answering questions or solving problems on almost any possible topic or to test their ability to reason or interpret texts. Instead, the evaluation of the knowledge that these models have of the languages has received much less attention. For example, the words that they can recognize and use in different languages. In this paper, we evaluate the knowledge that open-source chat LLMs have of Spanish words by testing a sample of words in a reference dictionary. The results show that open-source chat LLMs produce incorrect meanings for an important fraction of the words and are not able to use most of the words correctly to write sentences with context. These results show how Spanish is left behind in the open-source LLM race and highlight the need to push for linguistic fairness in conversational LLMs ensuring that they provide similar performance across languages.


Beware of Words: Evaluating the Lexical Richness of Conversational Large Language Models

arXiv.org Artificial Intelligence

The performance of conversational Large Language Models (LLMs) in general, and of ChatGPT in particular, is currently being evaluated on many different tasks, from logical reasoning or maths to answering questions on a myriad of topics. Instead, much less attention is being devoted to the study of the linguistic features of the texts generated by these LLMs. This is surprising since LLMs are models for language, and understanding how they use the language is important. Indeed, conversational LLMs are poised to have a significant impact on the evolution of languages as they may eventually dominate the creation of new text. This means that for example, if conversational LLMs do not use a word it may become less and less frequent and eventually stop being used altogether. Therefore, evaluating the linguistic features of the text they produce and how those depend on the model parameters is the first step toward understanding the potential impact of conversational LLMs on the evolution of languages. In this paper, we consider the evaluation of the lexical richness of the text generated by LLMs and how it depends on the model parameters. A methodology is presented and used to conduct a comprehensive evaluation of lexical richness using ChatGPT as a case study. The results show how lexical richness depends on the version of ChatGPT and some of its parameters, such as the presence penalty, or on the role assigned to the model. The dataset and tools used in our analysis are released under open licenses with the goal of drawing the much-needed attention to the evaluation of the linguistic features of LLM-generated text.